A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration

نویسندگان

  • Bo Zhao
  • Benjamin I. P. Rubinstein
  • Jim Gemmell
  • Jiawei Han
چکیده

In practical data integration systems, it is common for the data sources being integrated to provide conflicting information about the same entity. Consequently, a major challenge for data integration is to derive the most complete and accurate integrated records from diverse and sometimes conflicting sources. We term this challenge the truth finding problem. We observe that some sources are generally more reliable than others, and therefore a good model of source quality is the key to solving the truth finding problem. In this work, we propose a probabilistic graphical model that can automatically infer true records and source quality without any supervision. In contrast to previous methods, our principled approach leverages a generative process of two types of errors (false positive and false negative) by modeling two different aspects of source quality. In so doing, ours is also the first approach designed to merge multi-valued attribute types. Our method is scalable, due to an efficient sampling-based inference algorithm that needs very few iterations in practice and enjoys linear time complexity, with an even faster incremental variant. Experiments on two real world datasets show that our new method outperforms existing state-ofthe-art approaches to the truth finding problem.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Title: A Bayesian Approach to Discovering Truth from Conflicting Sources for Data integration Conference: VLDB 2012

Truth discovering is an interesting problem in data integration. In practical data integration system, it is common for the data sources being integrated to provide conflicting information about the same entity, thus raises the truth finding problem. The authors propose a Bayesian approach, the latent truth model, to solve the problem. The authors also conduct experiments regarding to both effe...

متن کامل

Integrating Conflicting Data: The Role of Source Dependence

Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values and different sources can often provide conflicting values. To present quality data to users, it is critical that data integration systems can resolve conf...

متن کامل

A Probabilistic Model for Estimating Real-valued Truth from Conflicting Sources

One important task in data integration is to identify truth from noisy and conflicting data records collected from multiple sources, i.e., the truth finding problem. Previously, several methods have been proposed to solve this problem by simultaneously learning the quality of sources and the truth. However, all those methods are mainly designed for handling categorical data but not numerical da...

متن کامل

Domain-Aware Multi-Truth Discovery from Conflicting Sources

In the Big Data era, truth discovery has served as a promising technique to solve conflicts in the facts provided by numerous data sources. The most significant challenge for this task is to estimate source reliability and select the answers supported by high quality sources. However, existing works assume that one data source has the same reliability on any kinds of entity, ignoring the possib...

متن کامل

Joint Bayesian Stochastic Inversion of Well Logs and Seismic Data for Volumetric Uncertainty Analysis

Here in, an application of a new seismic inversion algorithm in one of Iran’s oilfields is described. Stochastic (geostatistical) seismic inversion, as a complementary method to deterministic inversion, is perceived as contribution combination of geostatistics and seismic inversion algorithm. This method integrates information from different data sources with different scales, as prior informat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2012